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Abstract 

In this paper, an algorithm based on the 
concepts of genetic algorithms that uses an 
estimation of a probability distribution of 
promising solutions in order to generate new 
candidate solutions is proposed. To esti- 
mate the distribution, techniques for model- 
ing multivariate data by Bayesian networks 
are used. The proposed algorithm identifies, 
reproduces and mixes building blocks up to 
a specified order. It is independent of the 
ordering of the variables in the strings rep- 
resenting the solutions. Moreover, prior in- 
formation about the problem can be incor- 
porated into the algorithm. However, prior 
information is not essential. Preliminary ex- 
periments show that the BOA outperforms 
the simple genetic algorithm even on decom- 
posable functions with tight building blocks 
as a problem size grows. 



1 INTRODUCTION 

Recently, there has been a growing interest in opti- 
mization methods that explicitly model the good so- 
lutions found so far and use the constructed model to 
guide the further search (Baluja, 1994; Harik et al., 
1997; Miihlenbein & Paafi, 1996; Miihlenbein et al., 
1998; Pelikan & Miihlenbein, 1999). This line of re- 
search in stochastic optimization was strongly moti- 
vated by results achieved in the field of evolutionary 
computation. However, the connection between these 
two areas has sometimes been obscured. Moreover, the 
capabilities of model building have often been insuffi- 
ciently powerful to solve hard optimization problems. 

The purpose of this paper is to introduce an algorithm 
that uses techniques for estimating the joint distribu- 



tion of multinomial data by Bayesian networks in or- 
der to generate new solutions. The proposed algorithm 
extends existing methods in order to solve more diffi- 
cult classes of problems more efficiently and reliably. 
By covering interactions of higher order, the disrup- 
tion of identified partial solutions is prevented. Prior 
information from various sources can be used. The 
combination of information from the set of good so- 
lutions and the prior information about a problem is 
used to estimate the distribution. Preliminary experi- 
ments with uniformly-scaled additively decomposable 
problems with non-overlapping building blocks indi- 
cate that the proposed algorithm is able to solve all 
tested problems in close to linear time with respect to 
the number of fitness evaluations until convergence. 

In Section 2, the background needed to understand 
the motivation and basic principles of the discussed 
methods is provided. In Section 3, the Bayesian op- 
timization algorithm (BOA) is introduced. In subse- 
quent sections, the structure of Bayesian networks and 
the techniques used in the BOA to construct the net- 
work for a given data set and to use the constructed 
network to generate new instances are described. The 
results of the experiments are presented in Section 6. 
The conclusions are provided in Section 7. 

2 BACKGROUND 

Genetic algorithms (GAs) are optimization methods 
loosely based on the mechanics of artificial selection 
and genetic recombination operators. Most of the the- 
ory of genetic algorithms deals with the so-called build- 
ing Mocks (BBs) (Goldberg, 1989). By building blocks, 
partial solutions of a problem are meant. The ge- 
netic algorithm implicitly manipulates a large number 
of building blocks by mechanisms of selection and re- 
combination. It reproduces and mixes building blocks. 
However, a fixed mapping from the space of solutions 
into the internal representation of solutions in the al- 



gorithni and simple two-parent recombination opera- 
tors soon showed to be insufSciently powerful even for 
problems that are composed of simpler partial sub- 
problems. General, fixed, problem- independent re- 
combination operators often break partial solutions 
what can sometimes lead to losing these and converg- 
ing to a local optimum. Two crucial factors of the GA 
success — a proper growth and mixing of good building 
blocks — are often not achieved (Thierens, 1995). Var- 
ious attempts to prevent the disruption of important 
building blocks have been done recently and are briefly 
discussed in the remainder of this section. 

There are two major approaches to resolve the prob- 
lem of building-block disruption. The first approach 
is based on manipulating the representation of solu- 
tions in the algorithm in order to make the interact- 
ing components of partial solutions less likely to be 
broken by recombination operators. Various reorder- 
ing and mapping operators were used. However, re- 
ordering operators are often too slow and lose the race 
against selection, resulting in premature convergence 
to low-quality solutions. Reordering is not sufficiently 
powerful in order to ensure a proper mixing of partial 
solutions before these are lost. This line of research 
has resulted in algorithms which evolve the represen- 
tation of a problem among individual solutions, e.g. 
the messy genetic algorithm (niGA) (?), the gene ex- 
pression messy genetic algorithm (GEMGA) (Bandy- 
opadhyay et al., 1998), the linkage learning genetic 
algorithm (LLGA) (Harik & Goldberg, 1996), or the 
linkage identification by non-linearity checking genetic 
algorithm (LINC-GA) (Munetomo & Goldberg, 



A different way to cope with the disruption of partial 
solutions is to change the basic principle of recombina- 
tion. In the second approach, instead of implicit repro- 
duction of important building blocks and their mixing 
by selection and two-parent recombination operators, 
new solutions are generated by using the information 
extracted from the entire set of promising solutions. 

Global information about the set of promising solu- 
tions can be used to estimate their distribution and 
new candidate solutions can be generated according 
to this estimate. A general scheme of the algorithms 
based on this principle is called the estimation of distri- 
bution algorithm (EDA) (Miihlenbein & Paafi, 1996). 
In EDAs, better solutions are selected from an ini- 
tially randomly generated population of solutions like 
in the simple GA. The distribution of the selected set 
of solutions is estimated. New solutions are generated 
according to this estimate. The new solutions are then 
added into the original population, replacing some of 
the old ones. The process is repeated until the ter- 



mination criteria are met. However, estimating the 
distribution is not an easy task. There is a trade off 
between the accuracy of the estimation and its com- 
putational cost. 

The simplest way to estimate the distribution of 
good solutions is to consider each variable in a 
problem independently and generate new solutions 
by only preserving the proportions of the values of 
all variables independently of the remaining solu- 
tions. This is the basic principle of the population 
based incremental learning (PBIL) algorithm (Baluja, 
1994), the compact genetic algorithm (cGA) (Harik 
et al., 1997), and the univariate marginal distribu- 
tion algorithm (UMDA) (Miihlenbein & Paafi, 1996). 
There is theoretical evidence that the UMDA ap- 
proximates the behavior of the simple GA with uni- 
form crossover (Miihlenbein, 1997). It reproduces and 
mixes the building blocks of order one very efficiently. 
The theory of UMDA based on the techniques of quan- 
titative genetics can be found in Miihlenbein (1997). 
Some analyses of PBIL can be found in Kvasnicka et al. 
(1996). 

The PBIL, cGA, and UMDA algorithms work very well 
for problems with no significant interactions among 
variables (Miihlenbein, 1997; Harik et al., 1997; Pe- 
likan & Miihlenbein, 1999). However, partial solu- 
tions of order more than one are disrupted and there- 
fore these algorithms experience a great difficulty to 
solve problems with interactions among the variables. 
First attempts to solve this problem were based on 
covering some pairwise interactions, e.g. the incre- 
mental algorithm using the so-called dependency trees 
as a distribution estimate (Baluja & Davies, 1997), 
the population-based MIMIC algorithm using simple 
chain distributions (De Bonet et al., 1997), or the bi- 
variate marginal distribution algorithm (BMDA) (Pe- 
likan & Miihlenbein, 1999). In the algorithms based 
on covering pairwise interactions, the reproduction of 
building blocks of order one is guaranteed. Moreover, 
the disruption of some important building blocks of 
order two is prevented. Important building blocks of 
order two are identified using various statistical meth- 
ods. Mixing of building blocks of order one and two is 
guaranteed assuming the independence of the remain- 
ing groups of variables. 

However, covering only pairwise interactions has been 
shown to be insufficient to solve problems with interac- 
tions of higher order efficiently (Pelikan & Miihlenbein, 
1999). Covering pairwise interactions still does not 
preserve higher order partial solutions. Moreover, in- 
teractions of higher order do not necessarily imply 
pairwise interactions that can be detected at the level 



of partial solutions of order two. 

In the factorized distribution algorithm (FDA) 
(Miihlenbein et al., 1998), a factorization of the distri- 
bution is used for generating new solutions. The distri- 
bution factorization is a conditional distribution con- 
structed by analyzing the problem decomposition. The 
FDA is capable of covering the interactions of higher 
order and combining important partial solutions effec- 
tively. It works very well on additively decomposable 
problems. The theory of UMDA can be used in order 
to estimate the time to convergence in the FDA. 

However, the FDA requires the prior information 
about the problem in the form of a problem decompo- 
sition and its factorization. As an input, this algorithm 
gets a complete or approximate information about the 
structure of a problem. Unfortunately, the exact dis- 
tribution factorization is often not available without 
computationally expensive problem analysis. More- 
over, the use of an approximate distribution according 
to the current state of information represented by the 
set of promising solutions can be very effective even if 
it is not a valid distribution factorization. However, 
by providing sufficient conditions for the distribution 
estimate that ensure a fast and reliable convergence 
on decomposable problems, the FDA is of great the- 
oretical value. Moreover, for problems of which the 
factorization of the distribution is known, the FDA is 
a very powerful optimization tool. 

The algorithm proposed in this paper is also capable of 
covering higher order interactions. It uses techniques 
from the field of modeling data by Bayesian networks 
in order to estimate the joint distribution of promising 
solutions. The class of distributions that are consid- 
ered is identical to the class of conditional distribu- 
tions used in the FDA. Therefore, the theory of the 
FDA can be used in order to demonstrate the power 
of the proposed algorithm to solve decomposable prob- 
lems. However, unlike the FDA, our algorithm does 
not require any prior information about the problem. 
It discovers the structure of a problem on the fly. It 
identifies, reproduces and mixes building blocks up to 
a specified order very efficiently. 

In this paper, the solutions will be represented by bi- 
nary strings of fixed length. However, the described 
techniques can be easily extended for strings over any 
finite alphabet. String positions will be numbered se- 
quentially from left to right, starting with the posi- 
tion 0. 



3 BAYESIAN OPTIMIZATION 
ALGORITHM 

This section introduces an algorithm that uses tech- 
niques for modeling data by Bayesian networks to es- 
timate the joint distribution of promising solutions 
(strings). This estimate is used to generate new can- 
didate solutions. The proposed algorithm is called the 
Bayesian optimization algorithm (BOA). The BOA 
covers both the UMDA as well as BMDA and extends 
them to cover the interactions of higher order. The or- 
der of interactions that will be taken into account can 
be given as input to the algorithm. The combination 
of prior information and the set of promising solutions 
is used to estimate the distribution. Prior information 
about the structure of a problem as well as the infor- 
mation represented by the set of high-quality solutions 
can be incorporated into the algorithm. The ratio be- 
tween the prior information and the information ac- 
quired during the run used to generate new solutions 
can be controlled. The BOA fills the gap between the 
fully informed FDA and totally uninformed black-box 
optimization methods. Prior information is not essen- 
tial. 

In the BOA, the first population of strings is gener- 
ated at random. From the current population, the 
better strings are selected. Any selection method can 
be used. A Bayesian network that fits the selected 
set of strings is constructed. Any metric as a mea- 
sure for quality of networks and any search algorithm 
can be used to search over the networks in order to 
maximize the value of the used metric. New strings 
are generated using the joint distribution encoded by 
the constructed network. The new strings are added 
into the old population, replacing some of the old ones. 
The pseudo-code of the BOA follows: 

The Bayesian Optimization Algorithm (BOA) 

(1) set t ^ 

randomly generate initial population P(0) 

(2) select a set of promising strings S{t) from P(t) 

(3) construct the network B using a chosen metric and 
constraints 

(4) generate a set of new strings 0{t) according to the 
joint distribution encoded by B 

(5) create a new population P(t-\-l) by replacing some 
strings from P{t) with 0{t) 

sett^t+1 

(6) if the termination criteria are not met, go to (2) 



In the following section, Bayesian networks and the 



techniques for their construction and use will be de- 
scribed. 

4 BAYESIAN NETWORKS 

Bayesian networks (Howard & Matheson, 1981; Pearl, 
1988) are often used for modeling multinomial data 
with both discrete and continuous variables. A 
Bayesian network encodes the relationships between 
the variables contained in the modeled data. It repre- 
sents the structure of a problem. Bayesian networks 
can be used to describe the data as well as to generate 
new instances of the variables with similar properties 
as those of given data. Each node in the network cor- 
responds to one variable. By Xi , both the variable and 
the node corresponding to this variable will be denoted 
in this text. Each variable corresponds to one position 
in strings representing the solutions [Xi corresponds 
to the ith position in a string). The relationship be- 
tween two variables is represented by an edge between 
the two corresponding nodes. The edges in Bayesian 
networks can be either directed or undirected. In this 
paper, only Bayesian networks represented by directed 
acyclic graphs will be considered. The modeled data 
sets will be defined within discrete domains. 

Mathematically, an acyclic Bayesian network with di- 
rected edges encodes a joint probability distribution. 
This can be written as 



^-l 



p{X)=l[p{X,\UxJ, 



(1) 



i=0 



where X = (Xq, . . . , Xn-i) is a vector of variables, 
HXi is the set of parents of Xi in the network (the set 
of nodes from which there exists an edge to Xi) and 
p{Xi\Ilxi) is the conditional probability of Xi condi- 
tioned on the variables IIx^. This distribution can be 
used to generate new instances using the marginal and 
conditional probabilities in a modeled data set. 

The following sections discuss how to learn the network 
structure if this is not given by the user, and how to 
use the network to generate new candidate solutions. 

4.1 CONSTRUCTING THE NETWORK 

There are two basic components of the algorithms 
for learning the network structure (Heckerman et al., 
1994). The first one is a scoring metric and the sec- 
ond one is a search procedure. A scoring metric is 
a measure of how well the network models the data. 
Prior knowledge about the problem can be incorpo- 
rated into the metric as well. A search procedure is 
used to explore the space of all possible networks in 



order to find the one (or a set of networks) with the 
value of a scoring metric as high as possible. The space 
of networks can be reduced by constraint operators. 
Commonly used constraints restrict the networks to 
have at most k incoming edges into each node. This 
number directly influences the complexity of both the 
network construction as well as its use for generation 
of new instances and the order of interactions that can 
be covered by the class of networks restricted in this 
way. 

4.1.1 Bayesian Dirichlet metric 

As a measure of the quality of networks, the so-called 
Bayesian Dirichlet (BD) metric (Heckerman et al., 
1994) can be used. This metric combines the prior 
knowledge about the problem and the statistical data 
from a given data set. The BD metric for a network 
B given a data set D of size N, and the background 
information ^, denoted by p(D, B\^), is defined as 
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{m'{xi,TrXi) + m{xi,'Kx.)y- 
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where p[B\^) is the prior probability of the network 
B, the product over ttx^ runs over all instances of the 
parents of Xi and the product over Xi runs over all 
instances of X^. By 171(71 Xi), the number of instances in 
D with variables lixi (the parents of Xi) instantiated 
to TTXi is denoted. When the set Hxi is empty, there 
is one instance of Hxi and the number of instances 
with Hxi instantiated to this instance is set to N . By 
m(xi,'KXi), we denote the number of instances in D 
that have both Xi set to Xi as well as Hxi set to TTXi- 

By numbers m'^XijirXi) and p(B\^), prior informa- 
tion about the problem is incorporated into the metric. 
The m'(xi, ttxJ stands for prior information about the 
number of instances that have Xi set to Xi and the set 
of variables Hxi is instantiated to irXi ■ The prior prob- 
ability p(i3|^) of the network reflects how the measured 
network resembles the prior network. By using a prior 
network, the prior information about the structure of 
a problem is incorporated into the metric. The prior 
network can be set to an empty network, when there 
is no such information. In our implementation, we 
set p(B\^) = 1 for all networks, i.e. all networks are 
treated equally. 

The numbers m'{xi,'KXi) can be set in various ways. 
They can be set according to the prior information the 
user has about the problem. When there is no prior in- 



formation, uninforniative assignments can be used. In 
the so-called K2 metric, for instance, the m' {xi,iTXi) 
coefficients are all simply set to f (Heckerman et al., 
1994). This assignment corresponds to having no prior 
information about the problem. In the empirical part 
of this paper we will use the K2 metric. 

Since the factorials in Equation 2 can grow to huge 
numbers, usually a logarithm of the scoring metric is 
used. The contribution of one node to the logarithm 
of the metric can be computed in 0{2 N) steps where 
k is the maximal number of incoming edges into each 
node in the network and N is the size of the data 
set (the number of instances). The computation of an 
increase of the logarithm of the value of the BD metric 
for an edge addition, edge reversal, or an edge removal, 
respectively, can be computed in time 0{2 N) since 
the total sum contribution corresponding to the nodes 
of which the set of parents has not changed remains 
unchanged as well. Assuming that k is constant, we 
get linear time complexity of the computation of both 
the contribution of one node as well as the increase in 
the metric for an edge addition 0(N) with respect to 
the size of the data set. 



4.1.2 Searching for a Good Network 

In this section, the basic principles of algorithms that 
can be used for searching over the networks in order to 
maximize the value of a scoring metric are described. 
Only the classes of networks with restricted number of 
incoming edges denoted by k will be considered. 

a) k = 

This case is trivial. An empty network is the best one 
(and the only one possible). 

b) fc = 1 

For fc = 1, there exists a polynomial algorithm for the 
network construction (Heckerman et al., 1994). The 
problem can be easily reduced to a special case of the 
so-called maximal branching problem for which there 
exists a polynomial algorithm (Edmonds, 1967). 

c) fc > 1 

For fc > 1 the problem gets much more complicated. 
Although for fc = 1 there exists a polynomial algo- 
rithm for finding the best network, for fc > 1 the prob- 
lem of determining the best network with respect to 
a given metric is NP-complete for most Bayesian and 
non-Bayesian metrics (Heckerman et al., 1994). 

Various algorithms can be used in order to find a 
good network, from a total enumeration to a blind 
random search. Usually, due to their effectiveness in 
this context, simple local search based methods are 



used (Heckerman et al., 1994). A simple greedy algo- 
rithm, local hill-climbing, or simulated annealing can 
be used. Simple operations that can be performed on 
a network include edge additions, edge reversals, and 
edge removals. Each iteration, an operation that in- 
creases the network the most is applied. Only opera- 
tions that keep the network acyclic and with at most fc 
incoming edges into each of the nodes are allowed (i.e., 
the operations that do not violate the constraints). 
The algorithms can start with an empty network, the 
best network with one incoming edge into each node 
at maximum, or a randomly generated network. 

In our implementation, we have used a simple greedy 
algorithm with only edge additions allowed. The algo- 
rithm starts with an empty network. The time com- 
plexity of this algorithm can be computed using the 
time complexity of a simple edge addition and the 
number of edges that have to be processed at most. 
With the BD metric, the overall time to construct 
the network using the described greedy algorithm is 
0{k2 n^N + kn^). Assuming that fc is constant, we 
get the overall time complexity 0{n'^N + n^). 

4.2 GENERATING NEW INSTANCES 

In this section, the generation of new instances using 
a network B and the marginal frequencies for few sets 
of variables in the modeled data set will be described. 
New instances are generated using the joint distribu- 
tion encoded by the network (see Equation 1). 

First, the conditional probabilities of each possible in- 
stance of each variable given all possible instances of 
its parents in a given data set are computed. The con- 
ditional probabilities are used to generate each new 
instance. Each iteration, the nodes whose parents are 
already fixed are generated using the corresponding 
conditional probabilities. This is repeated until the 
values of all variables are generated. Since the net- 
work is acyclic, it is easy to see that the algorithm is 
defined well. 

The time complexity of generating an instance of all 
variables is bounded by 0{kn) where n is the number 
of variables. Assuming that fc is constant, the overall 
time complexity is Oin). 

5 DECOMPOSABLE FUNCTIONS 

A function is additively decomposable of a certain or- 
der if it can be written as the sum of simpler functions 
defined over the subsets of variables, each of cardi- 
nality less or equal than the order of the decomposi- 
tion (Miihlenbein et al., 1998; Pelikan & Miihlenbein, 



1999). The problems defined by this class of functions 
can be decomposed into smaller subproblems. How- 
ever, simple GAs experience a great difficulty to solve 
these decomposable problems with deceptive building 
blocks when these are not mapped tightly onto the 
strings representing the solutions (Thierens, 1995). 

In general, the BOA with fc > can cover interac- 
tions or order k + I. This actually does not mean 
that all interactions in a problem that is order-(fc+ 1) 
decomposable can be covered (e.g., 2D spin-glass sys- 
tems (Miihlenbein et al., 1998)). There is no straight- 
forward way to relate general decomposable prob- 
lems and what are the necessary interactions to be 
taken into account (or, what is the order of building 
blocks). By introducing overlapping among the sets 
from the decomposition along with scaling of the con- 
tributions of each of these sets according to some func- 
tion of problem size, the problem becomes very com- 
plex. Nevertheless, the class of distributions the BOA 
uses is very powerful the decomposable problems with 
either overlapping or non-overlapping building blocks 
or a bounded order. This has been confirmed by a 
number of experiments with various test functions. 

6 EXPERIMENTS 



set are either mapped close to each other or spread 
throughout the whole string. Each variable will be re- 
quired to contribute to the function through some of 
the subfunction. A function composed in this fashion 
is clearly additively decomposable of the order of the 
subfunctions it was composed with. 

A deceptive function of order 3, denoted by 
?>- deceptive, is defined as 



JSdeceptive \^) 



0.9 ifu=0 

0.8 ifu=l 

if w = 2 

1 otherwise 



(3) 



where u is the number of one's in an input string. 

A trap function of order 5, denoted by trap-5, is de- 
fined as 



ftrap5{u) 



u if u < 5 
otherwise 



(4) 



A bipolar deceptive function of order 6, denoted by 
6-bipolar, is defined with the use of the 3-deceptive 
function as follows 



JQhipolar\^) /Sdeceptiue V I ^1/ 



(5) 



The experiments were designed in order to show the 
behavior of the proposed algorithm only on non- 
overlapping decomposable problems with uniformly 
scaled deceptive building blocks. For all problems, the 
scalability of the proposed algorithm is investigated. 
In the following sections, the functions of unitation 
used in the experiments will be described and the re- 
sults of the experiments will be presented. 

6.1 FUNCTIONS OF UNITATION 

A function of unitation is a function whose value de- 
pends only on the number of ones in a binary input 
string. The function values for the strings with the 
same number of ones are equal. Several functions of 
unitation can be additively composed in order to form 
a more complex function. Let us have a function of 
unitation fk defined for strings of length fc. Then, the 
function additively composed of / functions Jk is de- 
fined as 



I{X) 



E 

i=0 



fk{S^), 



(2) 



where X is the set of n variables and Si for i G 
{0, . . . ,1 — 1} are subsets of fc variables from X. Sets 
Si can be either overlapping or non-overlapping and 
they can be mapped onto a string (the inner repre- 
sentation of a solution) so that the variables from one 



6.2 RESULTS OF THE EXPERIMENTS 

For all problems, the average number of fitness eval- 
uations until convergence in 30 independent runs is 
shown. For the 3-deceptive and trap-5 functions, the 
population is said to have converged when the propor- 
tion of some value on each position reaches 95%. This 
criterion of convergence is applicable only for prob- 
lems with at most one global optimum and selection 
schemes that do not force the algorithm to preserve 
the diversity in a population (e.g. niching methods). 
For the 6-bipolar function, the population is said to 
have converged when it contains over a half of opti- 
mal solutions. For all algorithms, the population sizes 
for all problem instances have been determined empiri- 
cally as a minimal size so that the algorithms converge 
to the optimum in all of 30 independent runs. In all 
runs, the truncation selection with r = 50% has been 
used (the better half of the solutions is selected). Off- 
spring replace the worse half of the old population. 
The crossover rate for the simple GA has been empir- 
ically determined for each problem with one problem 
instance. In the simple GA, the best results have been 
achieved with the probability of crossover 100%. The 
probability of flipping a single bit by mutation has 
been set to 1%. In the BOA, no prior information but 
the maximal order of interactions to be considered has 
been incorporated into the algorithm. 
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Figure 1: Results for 3-deceptive Function. 



Figure 3: Results for 6-bipolar Function. 
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Figure 2: Results for trap-5 Function. 



In Figure 1, the results for the 3-deceptive function 
are presented. In this function, the deceptive building 
blocks are of order 3. The building blocks are non- 
overlapping and mapped tightly onto strings. There- 
fore, one-point crossover is not likely to disrupt them. 
The looser the building blocks would be, the worse the 
simple GA would perform. Since the building blocks 
are deceptive, the computational requirements of the 
simple GA with uniform crossover and the BOA with 
A: = (i.e., the UMDA) grow exponentially and there- 
fore we do not present the results for these algorithms. 
Some results for BMDA can be found in Pelikan and 
Miihlenbein (1999). The BOA with fc = 2 and the K2 
metric performs the best of the compared algorithms 
in terms of the number of functions evaluations until 
successful convergence. The simple GA with one-point 
crossover performs worse than the BOA with fc = 2 as 
the problem size grows. For loose building blocks, the 
simple GA with one-point crossover would require the 
number of fitness evaluations growing exponentially 



with the size of a problem (Thierens, 1995). On the 
other hand, the BOA would perform the same since 
it is independent of the variable ordering in a string. 
The population sizes for the GA ranged from N = 400 
for n = 30 to N = 7700 for n = 180. The population 
sizes for the BOA ranged from N = 1000 for n = 30 
to iV = 7700 for n = 180. 

In Figure 2, the results for the trap-5 function are 
presented. The building blocks are non-overlapping 
and they are again mapped tightly onto a string. The 
results for this function are similar to those for the 
3-deceptive function. The population sizes for the GA 
ranged from N = 600 for n = 30 to iV = 8100 for 
n = 180. The population sizes for the BOA ranged 
from N = 1300 for n = 30 to N = 11800 for n = 180. 

In Figure 3, the results for the 6-bipolar function are 
presented. The results for this function are similar 
to those for the 3-deceptive function. In addition to 
the faster convergence, the BOA discovers a number of 
solutions out of totally 2e global optima of 6-bipolar 
function instead of converging into a single solution. 
This effect could be further magnified by using niching 
methods. The population sizes for the GA ranged from 
N = 360 for n = 30 to A^ = 4800 for n = 180. The 
population sizes for the BOA ranged from N = 900 
for n = 30to N = 5000 for n = 180. 



7 CONCLUSIONS 

The experiments have shown that the proposed algo- 
rithm outperforms the simple GA even on decompos- 
able problems with tight building blocks as the prob- 
lem size grows. The gap between the proposed al- 
gorithm and the simple GA would significantly en- 
large for problems with loose building blocks. For 



loose mapping the time requirements of the simple 
GA grow exponentially with the problem size. On 
the other hand, the BOA is independent of the order- 
ing of the variables in a string and therefore changing 
this would not affect the performance of the algorithm. 
The proposed algorithm works very well also for other 
problems with highly overlapping building blocks, e.g. 
spin-glasses, that are not discussed in this paper. 
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